Skip to content

Shared: Add YEAST desugaring library#21797

Merged
tausbn merged 9 commits intomainfrom
tausbn/yeast-desugaring-tool
May 7, 2026
Merged

Shared: Add YEAST desugaring library#21797
tausbn merged 9 commits intomainfrom
tausbn/yeast-desugaring-tool

Conversation

@tausbn
Copy link
Copy Markdown
Contributor

@tausbn tausbn commented May 5, 2026

This PR adds a cleaned-up prototype of the YEAST library that was developed in a hackathon a few years ago.

YEAST is intended to be a lightweight layer for performing various kinds of AST cleanup and desugaring directly on the parse tree produced by a tree-sitter parser. Rewrite rules are specified declaratively, with a query language that approximates that of tree-sitter, though notably with no alternation or anchors (and also with greedy semantics -- no backtracking). I expect that this will be sufficient for most uses.

Output templates also look like tree-sitter trees, with embedded rust blocks for specifying code that calculates an AST based on the given input.

Because the output AST may be an entirely different language from the input AST, this PR also adds a new node-types.yml format -- a lightweight reformulation of node-types.json intended for human consumption (unlike the latter).

Of note: the output format disallows having field-less child nodes. The node-types.yml format supports them, but YEAST itself will silently throw them away.


There's a lot of code in this PR, but it's just a prototype, so don't feel compelled to review it in detail.

DO, however, look at the documentation, and also the changes to the existing tree-sitter extractor infrastructure (the final two commits).

@tausbn tausbn force-pushed the tausbn/yeast-desugaring-tool branch 3 times, most recently from 0cb1793 to 33485ca Compare May 5, 2026 12:50
@tausbn tausbn added no-change-note-required This PR does not need a change note labels May 5, 2026
@tausbn
Copy link
Copy Markdown
Contributor Author

tausbn commented May 5, 2026

CI failure seems to be unrelated to my changes.

@tausbn tausbn marked this pull request as ready for review May 5, 2026 14:38
Copilot AI review requested due to automatic review settings May 5, 2026 14:38
@tausbn tausbn requested review from a team as code owners May 5, 2026 14:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces YEAST, a Rust library for declarative AST cleanup/desugaring on top of tree-sitter parse trees, and integrates it into the shared tree-sitter extractor so extraction can optionally run on a rewritten AST and/or validate against an alternate output node-types schema.

Changes:

  • Add new shared/yeast + shared/yeast-macros crates implementing the rule/query/template system, plus tests and documentation.
  • Extend the shared tree-sitter extractor to optionally run YEAST rules and to support separate output_node_types for schema generation and TRAP validation.
  • Update Bazel vendored Rust deps to include YEAST dependencies (and bump tree-sitter, cc, etc.).
Show a summary per file
File Description
shared/yeast/tests/test.rs End-to-end tests for parsing, query matching, tree building, and desugaring rules.
shared/yeast/tests/node-types.yml Test output schema in the new YAML node-types format.
shared/yeast/src/visitor.rs Converts a tree-sitter Tree into a YEAST Ast.
shared/yeast/src/tree_builder.rs Fresh identifier generation support for templates/rules.
shared/yeast/src/schema.rs Schema representation for kinds/fields (language-derived or YAML-derived).
shared/yeast/src/range.rs Serde helpers for (de)serializing tree_sitter::Range.
shared/yeast/src/query.rs Query AST and matching engine (captures, repetition, named/unnamed semantics).
shared/yeast/src/print.rs Debug printer for walking a YEAST AST.
shared/yeast/src/node_types_yaml.rs YAML ↔ JSON node-types conversion + schema construction from YAML.
shared/yeast/src/lib.rs Core YEAST types (Ast, Node, Rule, Runner) and rewrite application logic.
shared/yeast/src/dump.rs Human-readable AST dump utility used by tests.
shared/yeast/src/cursor.rs Cursor trait abstraction used by traversal/extractor integration.
shared/yeast/src/captures.rs Capture storage and utilities (single/repeated/optional).
shared/yeast/src/build.rs BuildCtx used by tree!/trees! macros to build synthetic nodes.
shared/yeast/src/bin/node_types_yaml.rs CLI tool to convert YAML node-types ↔ JSON node-types.
shared/yeast/src/bin/main.rs Minimal YEAST CLI for parsing and printing.
shared/yeast/doc/yeast.md Main YEAST documentation (architecture, query/template language, integration).
shared/yeast/doc/node-types-yaml.md Specification for the YAML node-types format and CLI usage.
shared/yeast/Cargo.toml New yeast crate manifest and dependencies.
shared/yeast/Cargo.lock Lockfile for the standalone shared/yeast crate.
shared/yeast/BUILD.bazel Bazel target for the yeast Rust library.
shared/yeast/.gitkeep Placeholder file for directory tracking.
shared/yeast/.gitignore Ignores shared/yeast/target.
shared/yeast/.envrc Direnv config for local development.
shared/yeast-macros/src/parse.rs Proc-macro parsing and codegen for query!, tree!, trees!, rule!.
shared/yeast-macros/src/lib.rs Proc-macro entry points and user-facing macro docs.
shared/yeast-macros/Cargo.toml New yeast-macros proc-macro crate manifest.
shared/yeast-macros/BUILD.bazel Bazel target for the yeast-macros proc-macro crate.
shared/tree-sitter-extractor/tests/multiple_languages.rs Updates tests to include output_node_types in LanguageSpec.
shared/tree-sitter-extractor/tests/integration_test.rs Updates tests to include output_node_types in LanguageSpec.
shared/tree-sitter-extractor/src/generator/mod.rs Generator uses output_node_types when provided.
shared/tree-sitter-extractor/src/generator/language.rs Adds output_node_types to generator Language.
shared/tree-sitter-extractor/src/extractor/simple.rs Uses output_node_types for schema validation in the simple extractor.
shared/tree-sitter-extractor/src/extractor/mod.rs Adds optional YEAST desugaring path and AstNode abstraction.
shared/tree-sitter-extractor/Cargo.toml Adds a path dependency on shared/yeast.
shared/tree-sitter-extractor/BUILD.bazel Adds Bazel dep on //shared/yeast.
ruby/extractor/src/generator.rs Populates output_node_types: None for Ruby/Erb generator languages.
ruby/extractor/src/extractor.rs Updates shared extractor invocation with new extract(...) params.
ql/extractor/src/generator.rs Populates output_node_types: None for QL generator languages.
ql/extractor/src/extractor.rs Populates output_node_types: None for QL simple extractor languages.
MODULE.bazel Adds/upgrades vendored crates (notably tree-sitter and new deps).
misc/bazel/3rdparty/tree_sitter_extractors_deps/defs.bzl Adds YEAST + YEAST-macros crates and bumps vendored dependencies.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.zstd-sys-2.0.16+zstd.1.5.7.bazel Updates cc dependency reference.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-ruby-0.23.1.bazel Updates cc dependency reference.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-ql-0.23.1.bazel Updates cc dependency reference.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-python-0.23.6.bazel Adds vendoring/build definitions for tree-sitter-python.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-json-0.24.8.bazel Updates cc dependency reference.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-embedded-template-0.25.0.bazel Updates cc dependency reference.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-0.26.8.bazel Bumps vendored tree-sitter to 0.26.8 and updates cc reference.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.iana-time-zone-haiku-0.1.2.bazel Updates cc dependency reference.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.find-msvc-tools-0.1.9.bazel Updates vendored find-msvc-tools version metadata.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.cc-1.2.61.bazel Updates vendored cc version metadata and dependencies.
misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.bazel Adds aliases for serde_yaml and tree-sitter-python; bumps tree-sitter alias.
Cargo.toml Adds shared/yeast and shared/yeast-macros to the workspace.
Cargo.lock Workspace lock updates (adds yeast crates; bumps tree-sitter, cc, etc.).

Copilot's findings

Comments suppressed due to low confidence (1)

shared/yeast/src/node_types_yaml.rs:303

  • schema_from_yaml_with_language also registers YAML unnamed: tokens using schema.register_kind(name), which only affects the named kind map. If the YAML adds any unnamed tokens not present in the tree-sitter language, QueryNode::UnnamedNode lookups will still fail because unnamed_kind_ids is never updated.

This should use an unnamed-kind registration path (updating unnamed_kind_ids) rather than register_kind.

  • Files reviewed: 52/55 changed files
  • Comments generated: 6

Comment thread shared/yeast/src/node_types_yaml.rs
Comment thread shared/yeast/src/lib.rs Outdated
Comment thread shared/yeast/doc/yeast.md Outdated
Comment thread shared/yeast/src/print.rs Outdated
Comment thread shared/tree-sitter-extractor/src/extractor/mod.rs
Comment thread shared/yeast/src/schema.rs
@tausbn tausbn force-pushed the tausbn/yeast-desugaring-tool branch from 33485ca to fb1d844 Compare May 5, 2026 15:03
@tausbn tausbn requested a review from Copilot May 5, 2026 15:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 51/54 changed files
  • Comments generated: 3

Comment thread shared/tree-sitter-extractor/src/extractor/mod.rs Outdated
Comment thread shared/tree-sitter-extractor/src/extractor/mod.rs Outdated
Comment thread shared/yeast/.envrc Outdated
@tausbn tausbn force-pushed the tausbn/yeast-desugaring-tool branch from fb1d844 to cba9c08 Compare May 5, 2026 15:19
@tausbn tausbn marked this pull request as draft May 5, 2026 18:40
@tausbn tausbn force-pushed the tausbn/yeast-desugaring-tool branch 2 times, most recently from ed1ba0a to 1b8f451 Compare May 5, 2026 21:24
@tausbn tausbn requested a review from Copilot May 5, 2026 21:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 49/50 changed files
  • Comments generated: 2

Comment thread shared/tree-sitter-extractor/src/extractor/simple.rs Outdated
Comment thread shared/tree-sitter-extractor/src/extractor/mod.rs
@tausbn tausbn force-pushed the tausbn/yeast-desugaring-tool branch from 1b8f451 to e612319 Compare May 5, 2026 21:48
Comment thread shared/yeast/doc/yeast.md
Comment on lines +56 to +59
The `Runner` applies rules by walking the tree top-down. At each node, it
tries each rule in order. If a rule's query matches, the node is replaced by
the transform's output, and the rules are re-applied to the result. If no
rule matches, the node is kept and its children are processed recursively.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work when that input and output node-types are distinct? My thinking is:

  • If no rule applies, we can't just return the node as it is, because it has the wrong type (the node kind is not valid in the output node-types)
  • A rule cannot apply multiple times, because the output AST cannot generally be matched by a query that operates on the input AST.
    • If a given node name appears in both the input and output ASTs, but means something completely different in the two ASTs, will we accidentally re-apply rules on the output node? (misinterpreting it to be an input node of that kind)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer is: it doesn't. The behaviour described where unknown nodes are passed through without modification only applies to the case where the input language is a subset of the output language.

As for your second point, I think it should work to match against both input and output types -- I'll add a test to make sure. The original intent was that one could have two phases of desugaring: the first one turns the input AST into a nicer output AST (this is the part we care more about right now), the second desugars complex constructs into simpler one (this is the part we focused on in the hackathon).

However, there is one subtlety that we might want to address: currently we rewrite the parent node before we descend into its children (and if no rewrite applies, we don't go back to it later). This makes actually implementing the second phase awkward -- when we see an output node we want to desugar, its children will still be in the input AST. It would probably be better to use a traversal where child nodes are handled before their parent. (However, for simple cleanup transformations, or more generally for transformations where we don't inspect the structure of the child nodes, it doesn't make a difference.)

Finally, if the same node type appears both in the input and output node types, then yes, we could accidentally re-apply rules to the output node. However, merely having the same node type isn't enough. The entire query has to match. Thus, if we do something like, say, change the name of a field, the query will stop matching, and we won't loop.

However, there is (at least) one case where we would loop. Consider the rule

(foo (_) @children) => (foo bar: {..children})

(that is, moving all unnamed children into a field). In this case, we would match repeatedly. The first time around we move all of the children into the bar field. The second time around we can still match, capturing an empty list of children in @children, and then overwrite the bar field, and so on. This continues until we hit the recursion depth limit (currently 100).

To mitigate this, what we could do is enforce that a given rule is only applied once, either globally or on a rule-by-rule basis. For simple AST cleanup, I don't currently see any issues with enforcing this behaviour globally. For more advanced desugaring, it might be detrimental.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a0a0e9e demonstrates that output->output transformations are indeed possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a solution to the rule repetition issue. It essentially adds a .repeated() method to Rule, which simply sets a corresponding internal boolean value to true. If the boolean is false, the rule in question is not run again on a node where it was just applied. (Other rules are unaffected, even rules that would change the current node.)

To me, single-use (per node) rules seems like the better default -- for most rules (especially cleanup rules) it is sufficient.

I also have a separate change that introduces a notion of "phases" to the runner. This makes it possible to specify separate "cleanup" and "desugar" phases.

I can add the former fix to this PR, if you prefer. Now that the CI is finally passing, I would rather just merge it and fix these issues in a separate PR (which, with some luck, won't have to run all the CI checks, unlike this one).

Copy link
Copy Markdown
Contributor

@asgerf asgerf May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to merge it like this and iterate on it from here.

For our current use-case, I'd expect the output types to diverge completely from the input types (no overlap except for coincidental name clashes). I think in practice we'll want one rule per node kind, possibly with a few node kinds being split across multiple rules. But there's no need to do everything in one PR.

tausbn and others added 8 commits May 6, 2026 11:34
YEAST (YEAST Elaborates Abstract Syntax Trees) is a framework for
transforming tree-sitter parse trees before CodeQL extraction.

Core components:
- shared/yeast/ — Ast, Node, Schema, query matching engine, captures,
  FreshScope, BuildCtx
- shared/yeast-macros/ — proc macros: query!, tree!, trees!, rule!

The query language is inspired by tree-sitter queries:
  (assignment left: (_) @lhs right: (_) @rhs)

Templates support embedded Rust ({expr}), splicing ({..expr}),
computed literals (#{expr}), and fresh identifiers ($name).

The rule! macro combines query and transform:
  rule!((for pattern: (_) @pat ...) => (call receiver: {val} ...))

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Human-friendly YAML alternative to tree-sitter node-types.json with
three sections: supertypes, named, unnamed. Supports bidirectional
conversion and building Schema objects from YAML.

Includes CLI binary (node_types_yaml) and documentation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Produces indented text showing node kinds, named fields, and leaf
content. Unnamed tokens are hidden unless inside a named field.
Used by tests for readable assertions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
12 tests covering parsing, queries, tree building, desugaring rules,
cursor navigation, and the shorthand rule! syntax.

Tests use a custom output node-types.yml with named fields for all
children (parameter, stmt, index), loaded via
schema_from_yaml_with_language.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Covers architecture, query language, template language
(tree!/trees!/rule!),
capture semantics, fresh identifiers, and extractor integration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
extract() gains a rules parameter. When empty, uses tree-sitter native
traversal (no behavior change). When non-empty, runs yeast desugaring
and extracts via traverse_yeast.

Adds AstNode trait abstracting over tree_sitter::Node and yeast::Node,
with minimal changes to existing Visitor methods (Node -> &N in 6
signatures).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Language and LanguageSpec gain optional output_node_types field.
When set, the generator produces dbscheme/QL from the output types
and the extractor validates TRAP against them.

All existing extractors pass None (no behavior change).
Ruby extract() calls gain vec![] for the new rules parameter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add BUILD.bazel files for the yeast and yeast-macros crates, register
them as dependencies of the shared tree-sitter extractor, and refresh
the vendored crate dependencies via update_tree_sitter_extractors_deps.sh.
@tausbn tausbn force-pushed the tausbn/yeast-desugaring-tool branch from e612319 to 60dcf88 Compare May 6, 2026 11:34
Adds a regression test verifying that desugaring rules can chain across
output-only node kinds: a first rule rewrites an input kind to an
output-only kind, and a second rule then rewrites that output-only
kind into another output-only kind. This exercises the schema lookup
for query patterns whose root kind is not present in the input
tree-sitter grammar.
@tausbn tausbn marked this pull request as ready for review May 6, 2026 14:51
@asgerf
Copy link
Copy Markdown
Contributor

asgerf commented May 7, 2026

I've looked at all the code and it looks good to me, although I'm not exactly an expert Rust reviewer.

The PR seems to bump the tree-sitter version we depend on, which can affect production. I started a DCA run for Ruby to get a bit more validation on that.

Copy link
Copy Markdown
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DCA run came back clean

LGTM! 🚀

@tausbn tausbn merged commit 33fc767 into main May 7, 2026
144 checks passed
@tausbn tausbn deleted the tausbn/yeast-desugaring-tool branch May 7, 2026 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants